Experience Replay Optimization (ERO) masks stored transitions in order to improve sample efficiency. Additional neural network, “replay policy”, takes features extracted from each transition and infers mask probability. Agent samples from masked transitions uniformly.
In order to train replay policy, binary masks (\( \mathbf{I} \)) produced by Bernoulli distribution are considered as action. The replay-reward (\( r^{r} \)) as defined as the difference between culmutive reward of the previous policy (\( \pi ^{\prime} \)) and that of agent policy (\( \pi \)); \( r ^{r} = r^{c}_{\pi} - r^{c}_{\pi ^{\prime}} \).
The policy gradient for mini batch can be written as follows;
\[ \sum _{j:B_j \in B^{\text{batch}}} r^r \nabla [ \mathbf{I}_j \log \phi + (1-\mathbf{I}_j) \log (1-\phi) ] \]
We plan to implement somthing like BernoulliMaskedReplayBuffer
to support ERO. Together with such future enhancement, users still need to implement neural network which infers probabilities of Bernoulli masks.